Skip to content

pipelines: reorganize around ingest mode#59

Merged
nate-smalls-s1 merged 5 commits intoSentinel-One:mainfrom
natesmalley:reorg-pipelines
Apr 27, 2026
Merged

pipelines: reorganize around ingest mode#59
nate-smalls-s1 merged 5 commits intoSentinel-One:mainfrom
natesmalley:reorg-pipelines

Conversation

@natesmalley
Copy link
Copy Markdown
Contributor

Summary

Reorganizes pipelines/ so contributors immediately see ingest mode (push vs pull),
introduces ingest_mode and auth_type fields in pipeline metadata.yaml, and
removes an orphan PAN-OS serializer that is functionally subsumed by an existing
community transform.

What changed

  • New tree shape under pipelines/:

    • push/syslog/<vendor>/<product>/
    • push/hec/<vendor>/<product>/
    • pull/api/<vendor>/<product>/
    • pull/object_store/<vendor>/<product>/
    • community/transform_ocsf/<vendor>/<product>/ (already existed; layout retained)

    Each new leaf has a .README.md documenting what belongs there and the required
    metadata.yaml fields.

  • Removed the orphan PAN-OS serializer at
    pipelines/community/serializers/Palo Alto Networks/serializer.lua.
    It is functionally subsumed by pipelines/community/transform_ocsf/paloalto_logs/,
    which targets the same OCSF class (Network Activity, class_uid=4001), is signed
    off with 100% required-field coverage, and handles a broader range of log types.
    The now-empty serializers/ umbrella is removed alongside it.

  • New metadata fields added to the pipeline metadata.yaml schema:
    ingest_mode and auth_type. Documented in pipelines/community/README.md and
    the top-level README.md. Schema applies to new pipelines added after this PR;
    existing entries in transform_ocsf/ will be backfilled in a follow-up.

What is NOT in this PR (intentional)

  • No deletions of existing community pipelines.
  • No backfill of new fields onto existing transform_ocsf/ entries.
  • No naming-consistency cleanup of the paloalto_* cluster.
  • No fix for transform_ocsf/palo_alto_networks_firewall/ (graded F /
    analyzer_limit / 0% required_field_coverage_pct) — flagged as a follow-up.

Test plan

  • CI passes (CodeQL, secret scanning, contributor automation)
  • git log --stat shows new push/ and pull/ directories created with
    leaf .README.md files, removal of the orphan PAN-OS serializer, and
    removal of the now-empty serializers/ umbrella
  • pipelines/community/transform_ocsf/paloalto_logs/serializer.lua is unchanged
    and remains the canonical PAN-OS Network Activity transform (still graded
    signed_off, 100% required-field coverage)
  • pipelines/community/README.md renders cleanly on github.com
  • No broken links in top-level README.md
  • Manual smoke test: re-import transform_ocsf/paloalto_logs/serializer.lua
    against a PAN-OS event sample and confirm OCSF Network Activity output
    (class_uid=4001) matches the prior baseline

Nate Smalley and others added 5 commits April 26, 2026 18:21
…ructure

Adds scaffolding for ingest-mode-first organization of community pipelines.
Each leaf README documents what belongs there and the required metadata.yaml
fields (ingest_mode, auth_type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f/paloalto_logs)

The serializer at pipelines/community/serializers/Palo Alto Networks/serializer.lua
covered only TRAFFIC and THREAT log types and produced OCSF Network Activity
(class_uid=4001) -- a strict subset of the existing community transform at
pipelines/community/transform_ocsf/paloalto_logs/, which is signed off, has
100% required-field coverage, and handles the same OCSF class plus a broader
range of log types. Removing the orphan and the now-empty serializers/ umbrella.

Out-of-scope follow-ups:
- pipelines/community/transform_ocsf/palo_alto_networks_firewall/ is graded F
  (analyzer_limit, 0% required_field_coverage_pct) -- needs a fix or removal.
- paloalto_logs/ vs paloalto_alternate_logs/ may be consolidatable; the latter
  appears to differ only in accepting variant field names (logtype/log_type/type).
- Naming consistency across the paloalto_* cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds pipelines/community/README.md establishing:

- The directory tree (push/{syslog,hec}, pull/{api,object_store},
  community/transform_ocsf/)
- Required metadata.yaml fields including ingest_mode and auth_type enums
- Naming conventions (lowercase, underscored, no spaces)

The new schema applies to new pipelines added after this PR; existing
transform_ocsf/ entries will be backfilled in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the Repository layout block to reflect push/{syslog,hec} and
pull/{api,object_store}, replaces the Pipelines Installation Guide with a
shorter Pipelines section pointing at the new structure, and updates the
Metadata requirements appendix with the new ingest_mode and auth_type fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nate-smalls-s1 nate-smalls-s1 merged commit 79a947d into Sentinel-One:main Apr 27, 2026
2 checks passed
nate-smalls-s1 pushed a commit that referenced this pull request Apr 27, 2026
Adds the new metadata fields introduced by #59 to all 129 existing
transform_ocsf/ pipeline metadata.yaml files. The fields are inserted
immediately after the existing ingestion_method line in each file. No
serializer logic, no pipeline JSON, no other metadata changed.

Values were derived per entry by combining:

1. Bound parser metadata (parsers/community/<source_name>/metadata.yaml)
   when the parser declares format=syslog/CEF/RFC/w3c/custom-syslog or
   ingestion_method containing "Syslog" or "HEC" -- the parser is
   authoritative when its declaration is unambiguous.

2. Vendor and product knowledge for the ~90 entries where the parser
   metadata is unclear (gron format with "streaming" or "unknown"
   ingestion_method, or no parser binding at all). Examples:
   - Cisco network kit (firewalls, ASA, Meraki, ISE, etc.) -> Syslog
   - Microsoft 365 / Entra / Defender management surfaces -> API Call (OAuth)
   - AWS managed services delivering to S3 (CloudTrail, ELB, Route53
     Resolver, GuardDuty export, VPC flow) -> Other - {object store with
     SQS notifications} (IAM Role)
   - Azure Event Hub-delivered streams (signin, defender email) ->
     Other - {Azure Event Hub stream (AMQP/Kafka protocol)} (OAuth)
   - SaaS REST APIs (Okta, Snyk, Wiz, Tenable, Mimecast, Netskope,
     Proofpoint, GitHub, Google Workspace, Cloudflare, etc.) -> API Call
     with the vendor's typical auth (Bearer Token, API Key & Secret,
     or OAuth)

Confidence per entry is recorded in
.reorg-prep/inventory/transform_ocsf_classifications.tsv as one of
high (103), medium (17), or low (9). Low-confidence entries are
genuinely generic placeholders (json_generic_logs, sample_test_logs,
microservice_tracing_logs, etc.) where a more specific value is not
derivable; they use Other - {Explain: ...} with the reason inline.

palo_alto_networks_firewall/ is intentionally not modified because it is
being removed in PR #60 (open).

Resulting distribution:
  Syslog                                              56
  API Call                                            39
  Other - {object store / Event Hub / agent / etc.}   34

Auth distribution:
  N/A (syslog / file-based / generic)                 75
  API Key & Secret                                    20
  OAuth                                               18
  IAM Role                                             8
  Bearer Token                                         7
  Other (Kafka SASL)                                   1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nate-smalls-s1 pushed a commit that referenced this pull request Apr 27, 2026
Moves 91 community pipeline directories from
pipelines/community/transform_ocsf/<name>/ into the ingest-mode-first
taxonomy introduced in #59:

  pipelines/push/syslog/<vendor>/<product>/      57 entries
  pipelines/pull/api/<vendor>/<product>/         29 entries
  pipelines/pull/object_store/<vendor>/<product>/  5 entries

The mode bucket is determined by each entry's ingest_mode field (backfilled
in #61). The vendor and product split is derived per entry from the
upstream parser binding and vendor/product convention; collisions across
the cluster (Cisco Meraki, Fortinet, Cloudflare, Zscaler, Microsoft, etc.)
are disambiguated with explicit product-name overrides documented in
.reorg-prep/inventory/transform_ocsf_migration_plan.tsv.

History is preserved on every entry (git mv).

What stays in pipelines/community/transform_ocsf/ (15 entries):
  - Generic / template / unknown-vendor entries: agent_metrics_logs,
    generic_access_logs, inngate_gateway_logs, json_generic_logs,
    json_nested_kv_logs, leef_template_logs, log4shell_detection_logs,
    mail_server_logs, microservice_tracing_logs, sample_test_logs,
    spam_detection_logs, sql_database_logs, syslog_space_delimited_logs,
    vpc_logs, jruby_application_logs.

What is NOT in this PR (intentional):
  - 23 entries scheduled for removal in #62 (broken-legacy, 7) and #63
    (first-party ingestion paths, 16) are NOT moved; they remain in
    transform_ocsf/ until those PRs merge. This PR has no overlap or
    conflict with #62/#63 -- merge order does not matter.
  - No serializer logic, no metadata.yaml content, and no pipeline JSON
    content was modified. Every change is a directory rename.
  - No naming-consistency cleanup (e.g., paloalto_* -> palo_alto/*) is
    applied yet; that is a separate follow-up.

The pipelines/push/{syslog,hec}/ and pipelines/pull/{api,object_store}/
directories are now populated -- the empty scaffolding from #59 finally
has content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants